Introduction to Open Data Science - Course Project

About the project

I’m began my PhD studies at University of Helsinki this Autumn and I’m expecting to be using R as part of my research project in musicology, employing a digital humanities approach. I’ve previously been taking courses on statistics (including analysis in R) via the open universities at University of Helsinki and the Hanken School of Economics. I recently got a new computer (and unfortunately lost most of my scripts from earlier courses), so I’m taking a fresh start on using R again now.

Hitherto, I’ve most often processed my numbers using spreadsheets such as Excel and Google Sheets, and I’ve developed techniques to do my calculations quickly in these programs. The downside to this is that I’ve most often felt that the threshold has been lower to use a spreadsheet than using R when processing numbers (and the quickest calculator is still the Spotlight in Mac Os X). So during this course, I hope to be able to learn some useful techniques for making sense of data quickly. Perhaps R Studio might become my program of choice some day. :)

The address to my Github repository is http://github.com/wilhelmkvist/IODS-project.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Nov 28 23:29:27 2021"

RStudio Exercise 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Sun Nov 28 23:29:27 2021"

1. Learning 2014 - an overview of the dataset

The dataset ‘Learning 2014’ contains data from a survey among students in the social sciences at University of Helsinki. The data was collected during an introductory course in statistics in late 2014 and early 2015. Participation was voluntary, but students were strongly encouraged to take part as they were reimbursed with extra points in the final exam. Respondents (N=183) were presented with a set of questions designed to reflect their attitudes towards statistics and university studies in general. Learning approaches were measured with questions designed to reflect deep, surface, and strategic approaches. The present dataset (a subset of a larger dataset) contains 166 observations (by all those who participated in the final exam) of 7 variables. In addition to mean scores reflecting attitude and learning approaches, the total points in the final exam are provided. Background information was provided on the respondents’ age and gender (female/male).

setwd("~/Documents/IODS-project")
learning2014 <- 
read.table("data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Below is an overview of the different variables. The course was clearly dominated by females, accounting for two of three students. Note the age span from 17 to 55 years (mean 25.5 years, median 22 years).

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

2. Exploring the dataset graphically

The chart below outlines attitude, learning approaches and final exam score (points) by gender. Males generally scored higher on attitude and deep learning approach, while females scored higher on strategic and surface learning approaches. The gender divide in the final exam was about equal, with only a few males outperforming females in the range with the very highest scores. Although males were generally slightly older, age and final score were not positively but negatively correlated for males (-0.24), for females the correlation was insignificant (-0.02). The strongest correlation was identified between attitude and final score (0.44), about equal between males and females. There was a slight correlation between attitude and deep learning approaches (0.11), even more so among males (0.17). However, there was no correlation between deep learning approaches and final scores (-0.01). Apart from attitude, demonstrating a strategic learning approach was positively correlated with points (0.15), while the demonstration of a surface learning approach was negatively correlated with final scores (-0.14).

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

3. Fitting a regression model

Based on the correlation between variables (analyzed above), I choose to fit a regression model in order to understand the impact of age, attitude and learning approach on exam points. As it turns out, the student’s attitude is highly significant (p < 0.001), while age and a strategic learning approach are moderately significant (p = 0.0981 and 0.0621 respectively).

options(scipen = 999) #to indicate the minimal p-values with zeros
model1 <- lm(points ~ age + attitude + stra, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 0.00006171726 ***
## age         -0.08822    0.05302  -1.664        0.0981 .  
## attitude     3.48077    0.56220   6.191 0.00000000472 ***
## stra         1.00371    0.53434   1.878        0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 0.0000000107

For the sake of comparison, I choose to make a second model. Excluding the modestly significant variables ‘age’ and ‘stra’ does not really seem to make any big difference (I have not conducted a hypothesis test between the models, but the intercept and attitude remain highly significant in both models.)

model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 0.00000000195 ***
## attitude      3.5255     0.5674   6.214 0.00000000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 0.000000004119

4. Predicting exam scores

I choose to go forth with the first model (model1). According to the model, exam scores can be predicted if we know the person’s age, attitude and scores measuring their strategic learning approach. As stated earlier, age is slightly negatively correlated with exam score, while attitude is to a much larger degree positively correlated with exam score. A strategic learning approach is moderately correlated with exam scores.

Let us make two example calculations, one for a moderately motivated young undergraduate (“Mike”), another for a highly motivated late bloomer (“Ritva”). Mike, age 22, scores on attitude and strategic learning 2 out of 5. Ritva, age 55, cores on attitude and strategic learning 5 out of 5. We can now predict their exam scores:

students <- c('Mike','Ritva')
age <- c(22, 55)
attitude <- c(2,5)
stra <- c(2,5)
new_data <- data.frame(students, age, attitude, stra)
predict(model1, newdata = new_data)
##        1        2 
## 17.92358 28.46580

Effectively, R here makes the following calculations using these coefficients:

model1$coefficients
## (Intercept)         age    attitude        stra 
## 10.89542805 -0.08821869  3.48076791  1.00371238

Calculating predictions using the coefficients can be done in the following manner. (Differences in decimals are due to rounding errors.)

#Mike:
10.89543+(-0.08822*22)+(3.48077*2)+(1.00371*2)
## [1] 17.92355
#Ritva:
10.89543+(-0.08822*55)+(3.48077*5)+(1.00371*5)
## [1] 28.46573

The R-squared value (0.2182) can be used to summarize how well the regression line fits the data. Using the R-squared value, we see that the model makes a fairly good fit, explaining about 22 per cent of the sample variance. (In a simplified manner, one could state that a fifth of the variation in exam scores is explained by the students’ attitudes.)

summary(model1)$r.squared
## [1] 0.218172

5. Graphical model validation

In the following, I will produce three diagnostic plots in order to graphically explore the validity of my model assumptions. By analyzing residuals vs fitted values, I explore the validity of my model. Using the QQ-plot I explore whether errors are normally distributed. By exploring residuals vs leverage I want to find out whether there are any outliers. In this case, errors seem to be normally distributed, not correlated, and having a constant variance of Sigma^2. No severe outliers are identified.

par(mfrow = c(2,2))
plot(model1, which = c(1,2,5))


RStudio Exercise 3

Describe the work you have done this week and summarize your learning.

date()
## [1] "Sun Nov 28 23:29:35 2021"

1. Exploring alcohol consumption among Portuguese secondary school students - an overview of the Alc dataset

The Alc dataset provides data on alcohol consumption among Portuguese secondary school students aged 15 to 22 (mean 16.6 years). The data was collected using school reports and questionnaires. Attributes include student grades, demographic, social and school related features. The Alc dataset was constructed by joining two separate sets on student performance in two distinct subjects: Mathematics and Portuguese. As some students were known to have appeared in both sets, duplicates were identified and removed.

The Alc dataset features a total of 370 observations of 33 variables (listed below). Note: alc_use was calculated as a mean of workday alcohol consumption and weekday alcohol consumption, high_use is a binary variable indicating whether the student drinks more than 2 doses per day on average. The data was used in a paper by Cortez and Silva (2008). More information on the dataset along with attribute descriptions can be found at The Machine Learning Repository website.

alc <- read.table("data/alc.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

There are a number of variables that can be studied to understand student alcohol consumption. To get a quick overview of the content and distribution of each variable we can use the following code:

fig.width=8
fig.height=6
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

In order to find out which variables could possibly be statistically relevant, I begin by running a regression model explaining “alc_use” with all other variables (except for “Dalc”, “Walc” and “high_use” which are directly related to “alc_use”).

# exclude variables "Dalc","Walc","high_use" which are directly related to "alc_use"
selvarnames <- names(alc) %in% c("Dalc","Walc","high_use")
alc2 <- alc[!selvarnames == T]
# create formula with where "alc_use" is explained by select variables (stored in alc2)
fmla <- as.formula(paste("alc_use~",paste(names(alc2)[1:31],collapse="+")))
# create linear regression model and read the summary
model1 <- lm(data=alc2, formula = fmla)
summary(model1)
## 
## Call:
## lm(formula = fmla, data = alc2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65003 -0.49853 -0.08492  0.38197  2.87986 
## 
## Coefficients:
##                   Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)      -0.308078   0.928209  -0.332        0.740172    
## schoolMS         -0.206366   0.166819  -1.237        0.216943    
## sexM              0.453517   0.098064   4.625 0.0000053993241 ***
## age               0.067109   0.043688   1.536        0.125478    
## addressU         -0.232701   0.119806  -1.942        0.052951 .  
## famsizeLE3        0.159470   0.099574   1.602        0.110221    
## PstatusT          0.016944   0.148389   0.114        0.909161    
## Medu              0.021611   0.065772   0.329        0.742688    
## Fedu              0.055074   0.056854   0.969        0.333409    
## Mjobhealth       -0.252106   0.227986  -1.106        0.269622    
## Mjobother        -0.111991   0.146597  -0.764        0.445450    
## Mjobservices     -0.135048   0.166597  -0.811        0.418166    
## Mjobteacher      -0.066798   0.210155  -0.318        0.750799    
## Fjobhealth        0.066192   0.305988   0.216        0.828870    
## Fjobother         0.179164   0.224244   0.799        0.424887    
## Fjobservices      0.399310   0.231379   1.726        0.085325 .  
## Fjobteacher      -0.115857   0.271994  -0.426        0.670421    
## reasonhome        0.045740   0.113435   0.403        0.687044    
## reasonother       0.409977   0.164654   2.490        0.013270 *  
## reasonreputation  0.056384   0.117446   0.480        0.631488    
## guardianmother   -0.149224   0.108597  -1.374        0.170344    
## guardianother    -0.232964   0.235168  -0.991        0.322593    
## traveltime        0.152445   0.069143   2.205        0.028161 *  
## studytime        -0.115010   0.058181  -1.977        0.048903 *  
## schoolsupyes      0.021980   0.140610   0.156        0.875877    
## famsupyes        -0.057759   0.097938  -0.590        0.555763    
## activitiesyes    -0.185204   0.090385  -2.049        0.041249 *  
## nurseryyes       -0.261139   0.112122  -2.329        0.020461 *  
## higheryes         0.237373   0.241308   0.984        0.325989    
## internetyes       0.042568   0.131414   0.324        0.746201    
## romanticyes       0.052334   0.097849   0.535        0.593119    
## famrel           -0.178569   0.048777  -3.661        0.000293 ***
## freetime          0.069539   0.048266   1.441        0.150606    
## goout             0.293976   0.042174   6.971 0.0000000000173 ***
## health            0.056055   0.032523   1.724        0.085729 .  
## failures          0.169898   0.089496   1.898        0.058521 .  
## paidyes           0.287764   0.095692   3.007        0.002840 ** 
## absences          0.028011   0.008503   3.294        0.001094 ** 
## G1               -0.057193   0.037889  -1.509        0.132136    
## G2                0.042812   0.047726   0.897        0.370356    
## G3                0.006646   0.034650   0.192        0.848011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8055 on 329 degrees of freedom
## Multiple R-squared:  0.4191, Adjusted R-squared:  0.3485 
## F-statistic: 5.935 on 40 and 329 DF,  p-value: < 0.00000000000000022

From the regression model summary, I can read that male sex is a highly significant variable as well as the quality of family relationships and going out with friends. The variables “paid” (extra paid classes in Math or Portuguese) and “absences” (number of school absences) are fairly significant. Moderately significant variables include study time, travel time, whether the student attended nursery school and whether the student has taken part in extra-curricular activities.

In order to understand the relationship between high alcohol consumption and select variables, I choose to visualize the relationship between high alcohol comsumption and sex/gender, family relationships, going out and study time. For the analysis, I construct a subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use. The visualizsation is presented in sex-disaggregated form.

library(ggplot2)
library(GGally)
#construct subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use)
alcpairs <- alc[, c(2, 22,24,14,35)]
#plot the pairs
ggpairs(alcpairs, mapping = aes(col=sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

From the visual presentation, I can formulate the following working hypotheses for the select variables:

  • sex: men are more likely than women to be high consumers
  • famrel: students with high alcohol consumption generally score lower on the quality of family relationships
  • goout: high alcohol consumption is strongly related to the frequency of going out, especially for men
  • study time: students using much alcohol generally study fewer hours

The table below shows a summary of key statistics related to heavy and moderate drinkers (high_use = True/False) looking at four variables (three numeric and sex/gender).

alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel=mean(famrel), goout=mean(goout), studytime=mean(studytime))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count famrel goout studytime
##   <fct> <lgl>    <int>  <dbl> <dbl>     <dbl>
## 1 F     FALSE      154   3.92  2.95      2.34
## 2 F     TRUE        41   3.73  3.39      2   
## 3 M     FALSE      105   4.13  2.70      1.90
## 4 M     TRUE        70   3.79  3.93      1.63

The table above indicates what has been stated previously: that Portuguese young heavy drinkers generally go out more than their moderately drinking peers. Also, they spend spend less time studying and they estimate their family relations to be worse.

To indicate the relationship between family relations and heavy drinking, let’s draw a separate box plot. As the chart indicates, moderate consumers of alcohol have better family ties The difference between groups is not, however, as big as the boxplot visualization would initially indicate. Therefore, red dots have been added to indicate mean values by group and sex.

The boxplot visualization indicates that students with high alcohol consumption score lower on the quality of family relationships, but the difference in mean values between groups is much lower than the visualization focused on integer values indicates: the difference is only 0.34 score points for males and 0.19 for females. In this respect, the boxplot visualization is easily misleading.

NB! 1) In both groups there are outliers scoring very low on family relations. In general, however, respondents have scored fairly high (median=4). NB! 2) The figure says little about whether the quality of family relations are the cause or consequence of high alcohol consumption.

# initialise a plot of high_use and famrel
g <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g + geom_boxplot() + ylab("family relation") + ggtitle("Quality of family relationships \n by alcohol consumption and sex") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun=mean)
## Warning: Removed 4 rows containing missing values (geom_segment).

We can justify the claim of a misleading visualization by adjusting the code for the previous table and also report the median values. Doing this allows us to see that there is no difference in median values between groups and sexes in family relations. Reporting median values will also tell us that there is a difference in the “goout” variable, with heavy drinkers more inclined towards going out. (The median study time remains the same, 4.)

alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel_mean=mean(famrel), famrel_median=median(famrel), goout_mean=mean(goout), goout_median=median(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 7
## # Groups:   sex [2]
##   sex   high_use count famrel_mean famrel_median goout_mean goout_median
##   <fct> <lgl>    <int>       <dbl>         <dbl>      <dbl>        <dbl>
## 1 F     FALSE      154        3.92             4       2.95            3
## 2 F     TRUE        41        3.73             4       3.39            4
## 3 M     FALSE      105        4.13             4       2.70            3
## 4 M     TRUE        70        3.79             4       3.93            4

Applying logistic regression to understand the relationship between high alcohol usage and select variables

In the logistic regression model, the computational target variable is the log of odds. From this it follows that applying the exponent function to the fitted values gives us the odds. That is, the exponents of the coefficients can be interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable (according to the course video on Data Camp).

From this we see that with an odds ratio of over 2, males are more than twice as likely to have a “success” (that is, become heavy drinkers) compared to their female peers when controlling for family relations, going out and study time. The same goes for those student spending much time going out. On the other hand, those students with good family relations and spending much time studying are not as likely to become heavy drinkers; their odds ratios are only about 2/3 compared with their average peers.

The confidence intervals presented in the two rightmost columns (2.5 % and 97.5%) in the table below gives us an indication about the spread. From this we see for instance that there is a wider spread between males than between those going out. Although the odds ratio is about the same in both groups some males will have considerably higher odds compared to some outgoers.

The data presented here largely supports my initial working hypotheses about men being more likely than women to be high consumers, about students with high alcohol consumption generally scoring lower on the quality of family relationships, about high alcohol consumption being related to the frequency of going out and about lower study times being related to higher alcohol consumption.

However, revisiting the hypotheses reveals to me a somewhat erroneous wording and approach, focusing to much on the faults of those that I have called heavy drinkers (or those with a positive high_use variable). It would perhaps be more fair to comment on the relationship between high daily alcohol doses and background factors, and try to remove any judgmental attitudes.

Below is the printout of the summary of the logistic regression model and the table with odds ratios and confidence intervals.

m <- glm(high_use ~ sex + famrel + goout + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + goout + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value       Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733        0.08309 .  
## sexM          0.7959     0.2669   2.982        0.00286 ** 
## famrel       -0.4193     0.1399  -2.996        0.00273 ** 
## goout         0.7873     0.1230   6.399 0.000000000157 ***
## studytime    -0.4814     0.1711  -2.814        0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM        2.2165443 1.31841735 3.7630180
## famrel      0.6575181 0.49791884 0.8636137
## goout       2.1974324 1.73873662 2.8198312
## studytime   0.6179355 0.43751933 0.8576353

Due to the chosen method at the start of this exercise, all of the variables controlled for were found to be statistically significant, one on a 0.1 per cent level and three on a 1 per cent level. I can therefore explore the predictive power of my model as such.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to "alc"
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions. The numbers are the count of individuals.
table(high_use = alc$high_use, prediction = alc$prediction) 
##         prediction
## high_use FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52
# tabulate the target variable versus the predictions. The table shows the proportions.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.62162162 0.07837838 0.70000000
##    TRUE  0.15945946 0.14054054 0.30000000
##    Sum   0.78108108 0.21891892 1.00000000

Based on the data and the reported alcohol intake, exactly 30 per cent of respondents (111 of 370) were classified as heavy drinkers (high_use = true). Equally, 70 per cent (259 individuals) were classified as non-heavy drinkers (high_use = false). According to the model, 29 of 259 moderately drinking individuals (11 per cent) were erroneously predicted to be heavy drinkers. Out of 111 heavy drinkers, a majority (59) were erroneously predicted not be high_users although they were according on the data.

Although the model did fairly well in recognizing non-heavy users, it did worse in predicting heavy drinkers among those who in fact were (based on the reported intake). Overall, the model predicted 78 per cent of respondents to be non-heavy users, while the actual proportion was 70 per cent.

The proportion of wrongly classified individuals is displayed visually in the plot below. As the visualization makes clear, the proportion of inaccurately classified individuals (i.e. the training error) is fairly high. The mean prediction error can be computed by defining a loss function and comparing classifications and probabilities. The calculation indicates that nearly 24 per cent are wrongly classified. I would not have expected the analysis to give such a high number. On the other hand, the model could become more accurate with a higher number of observations.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378

A note on the usefulness of the model

At the loss rate of 24 per cent, it seems as if the model is not very accurate. Can I use the information? It depends on the purpose. For instance, if the goal was to target alcohol drinkers and feed them with advertisements on social media (assuming that heavy drinkers would be more inclined to buy booze), the information could definitely be useful (if one wanted to target alcohol ads at minors…). For identifying who will become an alcoholic in five years and who would need interrogative and intervening actions, the model is far too inaccurate. At this level of accuracy, one could perhaps only target information campaigns on the potential harms caused by drinking at an early age.

Bonus: Performing ten-fold cross-validation on the model

With ten-fold cross-validation, the prediction error seems to be larger than when using the loss function (with a difference of about one percentage point).

# Performing ten-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2378378
# calculating the difference between using ten-fold cross-validation and a loss function
cv$delta[1]-loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0

Super-Bonus: Performing cross-validation to compare the performance of different logistic regression models

Finally, I will perform cross-validation to compare the performance of different logistic regression models using different sets of predictors. I start with the maximum number of predictors and explore the changes in the training and testing errors as I drop variables one by one. The original dataset contained 33 variables, but since high_use is directly related to Walc and Dalc, I choose to exclude these for computational feasibility.

In practice, I begin with creating a vector with running numbers in decreasing order from 31 to 1 indicating how many variables I will test at a time. I then create two more empty vectors, where I will save the results from the computational exercise. The three vectors will be used to construct the resulting data frame.

I then write a for loop to construct as many formulas as there are variables at any given time. I begin by making the cross-validation using the largest number of variables (31) The results are saved in a data frame along with the number of variables used. The resulting table will have three columns indicating Number of variables, Prediction error rate and Training error rate and 31 rows, one for every run.

Finally, I construct a plot indicating how both prediction and training errors decrease as the number of variables increase, with prediction errors always being greater. Looking at the resulting plot, I found it initially tough to digest the great variance and especially the initially increasing trend in prediction errors. But I guess that might have been be a result of a varying number of statistically significant variables being used in the different models.

The downside of the fairly long code written below is the time it takes to execute it. I have found that executing this last chunk alone - comparing different models with up to 31 variables - takes about three minutes. The time needed makes me relucant to run the code and test for any small changes, unless I have reduced the number of variables to a handful. If you have any suggestions on how to improve the code and speed up the process, I will gladly appreaciate your suggestions!

library(dplyr)
library(boot)
howmanyvar <- 31 #Enter here how many variables you want to test for. 31 is the maximum. (33 variables were included in the original dataset but two of these will always be excluded for computational feasibility and avoidance of near-perfect correlation.)
#create vector, in sequence, starting from the number above, descending by 1.
v <- rev(seq(1,howmanyvar))
#create empty numeric vectors of same length for the results.
trainingerrors <- integer(howmanyvar)
predictionerrors <- integer(howmanyvar)
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
for(i in v) {
#Within the for loop, I first create temporary subsets called "alctest". I choose to exclude variables "Dalc", "Walc" as these were found to cause warnings when executing the logistic regression model and counting probabilities (highly correlated with "high_use"). I also exclude "probability" and "prediction" as the probability has been calculated based on the whole dataset and this will now be replaced.
exclvarnames <- names(alc) %in% c("Dalc","Walc", "alc_use", "probability", "prediction")
alctest <- alc[!exclvarnames == T]
#From the alctest subset I will gradually drop columns one at a time, starting from the number of variables entered in "howmanyvar". In alctest, 32 has become the index number for the variable high_use. Therefore, I always want to include that one.
alctest <- dplyr::select(alctest, 1:v[i], 32)
#However, I want to exclude "high_use" from the vector with columns names that I want to use on the right-hand side in the formula.
fnames <- names(alctest)[names(alctest) !="high_use"]
# create formula where "alc_use" is explained by the variables
f <- as.formula(paste("high_use~",paste(fnames,collapse="+")))
# run a logistic regression
m2 <- glm(f, data = alctest, family = "binomial")
# predict() the probability of high_use using model m2
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to "alctest"
alctest <- mutate(alctest, probability = probabilities)
# compute the average number of wrong predictions in the (training) data and save the result in vector
trainingerrors[i] <- loss_func(alctest$high_use,alctest$probability)
# K-fold cross-validation
cv <- cv.glm(data = alctest, cost = loss_func, glmfit = m2, K = nrow(alctest))
# compute the average number of wrong predictions in the cross validation and save the result in vector
predictionerrors[i] <- cv$delta[1]
}
results <- data.frame(variables=v, trainingerrors=trainingerrors, predictionerrors=predictionerrors)
results
##    variables trainingerrors predictionerrors
## 1         31      0.1918919        0.2567568
## 2         30      0.1891892        0.2621622
## 3         29      0.1945946        0.2540541
## 4         28      0.1972973        0.2540541
## 5         27      0.2162162        0.2783784
## 6         26      0.2108108        0.2621622
## 7         25      0.2162162        0.2702703
## 8         24      0.2162162        0.2729730
## 9         23      0.2675676        0.3162162
## 10        22      0.2594595        0.3270270
## 11        21      0.2729730        0.3216216
## 12        20      0.2675676        0.3297297
## 13        19      0.2702703        0.3297297
## 14        18      0.2783784        0.3270270
## 15        17      0.2810811        0.3162162
## 16        16      0.2783784        0.3162162
## 17        15      0.2810811        0.3081081
## 18        14      0.2810811        0.3081081
## 19        13      0.2729730        0.3135135
## 20        12      0.2783784        0.3162162
## 21        11      0.2729730        0.3324324
## 22        10      0.2918919        0.3189189
## 23         9      0.2864865        0.3108108
## 24         8      0.2918919        0.3162162
## 25         7      0.2945946        0.3054054
## 26         6      0.2945946        0.3054054
## 27         5      0.2918919        0.3027027
## 28         4      0.2972973        0.3135135
## 29         3      0.3054054        0.3135135
## 30         2      0.3000000        0.3000000
## 31         1      0.3000000        0.3000000
p <- ggplot(results, aes(x=variables)) + geom_line(aes(y=predictionerrors, color="prediction")) + geom_line(aes(y=trainingerrors, color="training"))
p + ggtitle("Relation between error rates\n and number of variables") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Number of variables") + ylab("Error rate") + scale_color_discrete(name="Type of error")


RStudio Exercise 4

date()
## [1] "Sun Nov 28 23:32:33 2021"

The Boston dataset - a brief description

The Boston dataset, included in the MASS package, contains 506 observations of 14 variables relating to housing in the Boston region. Each observation describes a Boston suburb or town. Variables include information such as crime rate, air pollution, ethnic composition, proportion of land for large properties or industries, taxation, distances, communications and pricing. As has been state elsewhere the dataset has become a much-used pedagogical tool for teaching regression analysis and machine learning. First, let us swiftly explore the contents!

# accessing the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# loading the Boston data
data("Boston")
# typing ?Boston will open the documentation on the variables in the R console. A description can also be found [here] (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html).
?Boston
# a summary of the content of the variables is given below
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

It is perhaps hard to form an understanding of the relationship between the 14 variables simply by mapping pairs. Drawing a corrplot diagram allows us to get a quick overview of the correlation between variables. The greater the circle, the greater the correlance. Red indicates negative correlance, blue positive.

# plotting the relation between variables with corrplot
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# rounding values
cor_matrix<-cor(Boston) %>% round(2)
# drawing a corrplot
corrplot(cor_matrix, method="circle", type = "upper", tl.cex = 0.6, tl.pos = "d")

# contrary to the instructions provided on Data Camp, I choose to exclude the 'cl.pos = "b"' command, as I prefer to have the color legend vertically on the right rather than horizontally below the figure.

What sticks out from the visualization above is the strong negative correlation between the distance to employment centres (dis) and the proportion of old houses (age), nitrogen oxides concentration in the air (nox) and proportion of non-retail businesses (indus).That is, the further away we are from employment centres, the higher the proportion of new houses, the higher the share of industrial properties and the higher the concentration of nitrogen oxides in the air.

Similarly, there is a strong positive correlation between accessibility to highways (rad) and property-tax rate (tax). That is, the better the access to highways, the higher the property tax rate.

In the following section, I will standardize the variables which allows me to compare them more easily and perform additional operations using them. (Ideally, I would explore the differences between the standardized and non-standardized variables graphically, for instance using the ggplot bar graph technique described here, but as this was not specifically demanded, I suspect this would perhaps be beyond the scope of this exercise.) For now, it will suffice to print out a summary of the variables in the standardized set.

From comparing the summaries, it can be seen that the range of select variables has decreased significantly (e.g. crim, zn, indus). In fact, the maximum value of crim (9.92) represents the maximum value of all standardized variables. As all variables have now been centered around a mean of zero, minimum values have all become negative for all variables.

# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summarize the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As instructed, I continue by creating a categorical variable for the crime rate using the quantiles as break points. I choose to name the variables from low to high. I then drop the old crime rate variable from the boston_scaled dataset.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Moreover, I divide the dataset into train and test sets so that 80% of the data belongs to the train set. This is done as a means of preparation for fitting a linear discriminant analysis model on the train set.

# dividing the dataset into train and test sets, so that 80% of the data belongs to the train set. I randomly assign 80 per cent of the rows in the Boston dataset to the train set.
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

Now that I have created the train and test sets, I continue by fitting the linear discriminant analysis on the train set. I choose the categorical crime rate (crime) as target variable and all the other variables as predictor variables. I will also plot the LDA fit. The chunk below includes a code for defining a function for enriching the plot with arrows.

# fitting the linear discriminant analysis on the train set using the categorical crime rate as target variable and all other variables in the dataset as predictor variables.
lda.fit <- lda(crime ~ ., data = train)
# this function can be used for adding arrows to the biplot that will be plotted next
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the lda results with arrows added
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

The image drawn above gives an indication of in which direction the various variables draws the results. The longest arrow ic clearly given by the rad variable, zn and nox also pulling the results fairly strongly in different directions.

Continuing to follow the given instructions, I save the crime categories from the test data before removing the crime variable. This allows me to evaluate the correctness of predictions when using the test data to predict crime classifications. The cross tabulation of predictions and correct results is given below.

The results of the cross tabulation are interesting especially looking at the four corners. None of the areas where the crime rate was predicted low actually had high crime rates and vice versa. And similarly: where crime rates were predicted high, they actually were high, and where they were predicted low, they chiefly were low. Some variation occured nevertheless as to how right the prediction was. For instance, of those areas with low rates only half were correctly classified as areas with low rates (the other half were classified as med_low with one instance as med_high). It seems as if the model did a better job in getting areas with high criminal rates right than areas with low rates.

# saving crime categories from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predicting classes with the test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulating the results with the crime categories from the test set (removed from the test set)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19       9        3    0
##   med_low    6      18        3    0
##   med_high   1       8       15    1
##   high       0       0        1   18

I will now continue to execute the last step of the exercise proper. I begin by reloading the Boston dataset and standardize the variables. I will prepare a euclidean distance matrix to calculate the distances between the observations, and then run a k-means algorithm on the dataset to let the computer sort and clusterize the data. The clusters generated by the k-means algorithm will be plotted.

# reloading the Boston dataset
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- scale(Boston)
# preparing a euclidean distance matrix
dist_eu <- dist(Boston)
# running a k-means algorithm on the dataset
km <-kmeans(Boston, centers = 3)
# plotting the results
pairs(Boston, col = km$cluster)

As is seen, the chart above looks really busy. We can determine the optimal number of clusters by plotting the results of a k-means algorithm run with the numbers from 1 to 10. The result is given below.

The optimal number of clusters is supposed to be when the curve drops sharply. In this case, there is no self-evident answer to what is the optimal number. In my interpretation, the drop is at its sharpest with two cluster centers, after which the decline slows down. I find it reasonable to cluster using two centers. A new plot is printed below.

# calculating the total within sum of squares (up to 10 clusters)
set.seed(123) #this command is used in conjunction with the function
twcss <- sapply(1:10, function(k){kmeans(Boston, k)$tot.withinss})
# visualizing the results
library(ggplot2)
qplot(x = 1:10, y = twcss, geom = 'line') + scale_x_continuous(breaks = c(2,4,6,8,10), limits=c(1,10))

# running again the k-means algorithm on the dataset with the newly determined optimal number of clusters
km <-kmeans(Boston, centers = 2)
# plotting the results with pairs
pairs(Boston, col = km$cluster)

For the bonus section, I will run the k-means algorithm on the original (standardized) Boston data with 3 cluster centers. An LDA is performed with clusters as target classes. The biplot with arrows is given below. As it appears as if zn, nox, medv and tax are the most influential linear separators for the clusters. That is, the proportion of residential land allocated for large properties, air pollution, price level and property tax rate seem to be the variables most strongly influencing which cluster a particular area belongs to.

# reloading the Boston dataset once more
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- as.data.frame(scale(Boston))
# performing k-means on the dataset
km <-kmeans(Boston, centers = 3)
# saving clusters to be used with the LDA
clusters <- km$cluster
# adding the new clusters variable to the set
Boston <- data.frame(Boston, clusters)
# dividing the dataset into train and test sets to be used with the LDA
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
train2 <- Boston[ind,]
# removing chas because I repeatedly received the error message "variable 4 appears to be constant within groups"
train2 <- dplyr::select(train2, -chas)
# performing the lda
lda.fit2 <- lda(clusters ~ ., data = train2)
# defining the arrows function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# plotting the lda results with arrows
plot(lda.fit2, dimen = 2, col=clusters, pch=clusters, ylim=c(-5,7),xlim=c(-5,5))
lda.arrows(lda.fit2, myscale = 4)

For the super bonus section, I apply the given code that helps me to create a matrix product, a projection of the data points that will be visualized. I will draw two 3D plots, one where the color is defined by the categorical crime classes, another where the color is defined by the clusters of the k-means. As can be seen here, the shape of the plots is identical. However, the colouring attributed by the categorical crime classes is much more ‘neat’, aiding the eye in forming an understanding of which elements that belong together. For instance, all yellow values are gathered at the left hand side of the chart (with high x values). Turning and twisting the graphic with the mouse helps understanding the data even better.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# accessing the plotly package in order to create a 3D plot of the columns of the matrix product. (I have ran the command install.packages("plotly") once already.)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~classes)
# extracting the clusters from the second train set
cluscol <- train2$clusters
# drawing the second plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~cluscol)